Google Web 1T 5-Grams Made Easy (but not for the computer)
نویسنده
چکیده
This paper introduces Web1T5-Easy, a simple indexing solution that allows interactive searches of the Web 1T 5-gram database and a derived database of quasi-collocations. The latter is validated against co-occurrence data from the BNC and ukWaC on the automatic identification of non-compositional VPC.
منابع مشابه
Analysis of Czech Web 1T 5-Gram Corpus and Its Comparison with Czech National Corpus Data
In this paper, newly issued Czech Web 1T 5-grams corpus created by Google and LDC is analysed and compared with reference n-gram corpus obtained from Czech National Corpus. Original 5-grams from both corpora were post-processed and statistical trigram language models of various vocabulary sizes and parameters were created. The comparison of various corpus statistics such as unique and total wor...
متن کاملReal-Word Spelling Correction using Google Web 1T 3-grams
We present a method for detecting and correcting multiple real-word spelling errors using the Google Web 1T 3-gram data set and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Our method is focused mainly on how to improve the detection recall (the fraction of errors correctly detected) and the correction recall (the fraction of errors correc...
متن کاملUnsupervised Approaches to Text Correction Using Google N-grams for English and Romanian
We present an unsupervised approach that can be applied to test corrections tasks such as real-word error correction, near-synonym choice, and preposition choice, using n-grams from the Google Web 1T dataset. We present in details the method for correcting preposition errors, which has two phases. We categorize the n-gram types based on the position of the gap that needs to be replaced with a p...
متن کاملNgram Search Engine
In this paper, we will describe an idea and its implementation for an ngram search engine for very large sets of ngrams. The engine supports queries with an arbitrary number of wildcards. It takes a fraction of a second for a search, and can provide the fillers of the wildcards. We implemented the system using two datasets. One is the 1 billion 5-grams provided by Google (Web 1T data), the othe...
متن کاملMinimal Perfect Hash Rank: Compact Storage of Large N-gram Language Models
In this paper we propose a new method of compactly storing n-gram language models called Minimal Perfect Hash Rank (MPHR) that uses significantly less space than all known approaches. It requires O(n) construction time and allows for O(1) random access of probability values or frequency counts associated with n-grams. We make use of minimal perfect hashing to store fingerprints of n-grams in an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010